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VISUAL TUNING INTERFACE FOR A TEXT-TO-SPEECH ENGINE 

FIELD OF THE INVENTION 
[0001] The present invention generally relates to speech synthesis and 
in particular to the tuning of the text-to-speech conversion process. 

BACKGROUND OF THE INVENTION 

[0002] Communicating with computers using speech as a medium 
remains an open-ended pursuit for the research community. Flawless speech-to- 
speech communication between a user and a computer remains a long-term 
goal. At present, however, text-to-speech conversion is one area of speech 
synthesis that has received considerable commercial attention. In such text-to- 
speech conversion process, a user supplies text as an input to a computer, and 
then the computer outputs a speech equivalent to the entered text in a spoken 
(audio) form. Typically, a software engine drives the process of converting text- 
to-speech. The actual audio is produced by using widely available sound-cards. 

[0003] Several applications that process routine user-queries or make 
announcements use the technique of text-to-speech conversion. For example, 
announcements within trains or on train-stations, searching a company telephone 
directory, querying bank account balances, announcing waiting-times in a 
dynamic manner, etc. A popular use of text-to-speech systems is in call-center 
operations. While a large number of text-to-speech conversion systems are used 
in the telephone-based query setup, other non-telephone based applications also 
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exist. Customization of the text-to-speech systems for various applications is 
described next. 

[0004] Text-to-speech conversion, though automatic in operation, can 
require customization depending upon the needs of a given application. For 
example, in a typical telephone based bank-account query system that informs 
the account holder about the current balance of an account, the system must 
pronounce the balance information precisely and slowly. However, in other text- 
to-speech systems, such as a phone-based airport information query system, it 
would be desirable to have the system quickly announce the list of all delayed 
flights on a given day to avoid long wait-times for other callers. In other words, 
the text-to-speech process needs to be customized, depending upon the 
requirements of the particular application, either to produce fast or slow-paced 
speech output. The pace of speech output is but one of many parameters of the 
text-to-speech conversion systems that need to be customized. Hence, there is 
a need for a customizable or a tunable text-to-speech conversion system. 

[0005] A typical way of customizing a text-to-speech system is to 
manually insert control tags or commands in the text input file that is fed to a text- 
to-speech conversion engine. The control tags will typically modify the speech 
output in a number of ways such as pronouncing certain words fast or slow, 
controlling the pause interval between selected words, etc. However, this 
approach presents several problems. First, customization of input text with 
control tags will require a person of considerable training to insert the control tags 
in the text input at proper places to achieve the required speech modulation. 
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Second, entering control tags intermingled with the basic text is a non-intuitive 
and certainly not a user-friendly way of modifying the speech output. Third, even 
for a person of considerable training, it will be inefficient to edit the text file, edit 
the control tags, listen to the output, and repeat the process until the required 
output is achieved. Hence, there is a need for a user-friendly technique for 
modulating the speech output produced by a text-to-speech conversion system. 

SUMMARY OF THE INVENTION 

A system for tuning the text-to-speech conversion process is described. 
The system includes a text-to-speech engine that converts the input text into a 
processed form of Parameterized Aligned Sound Records (PASR) format. The 
PASR format includes speech features of the text input. A visual editing interface 
displays the text with speech features being represented as visual indicators 
such as font, color, spacing, bold, italic, etc. The user can edit the text and the 
visual indicators to modify the underlying speech features of the text. The user 
can generate the speech audio to test the text-to-speech conversion, and repeat 
the editing-testing process till a desired speech output is achieved. User can 
save the processed text in a database and retrieve the same lateron. 

[0006] Further areas of applicability of the present invention will 
become apparent from the detailed description provided hereinafter. It should be 
understood that the detailed description and specific examples, while indicating 
the preferred embodiment of the invention, are intended for purposes of 
illustration only and are not intended to limit the scope of the invention. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
[0007] The present invention will become more fully understood from 
the detailed description and the accompanying drawings, wherein: 

[0008] Fig. 1 is a system overview diagram for the visual tuning of text- 
to-speech conversion process employed in the present invention; 

[0009] Fig. 2 shows a representation of the PASR format conversion 
process. 

[0010] Fig. 3 shows an exemplary GUI editor; 

[0011] Fig. 4 is a graphical representation of the design of the visual 
tuning system according to the principle of the present invention; and 

[0012] Fig. 5 shows the relation between the design of the tuning 
system and the GUI editor. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
[0013] The following description of the preferred embodiment(s) is 
merely exemplary in nature and is in no way intended to limit the invention, its 
application, or uses. 

[0014] Fig.1 is a system overview diagram for the visual tuning of the 
text-to-speech conversion process employed in the present invention. The 
Visual Text-to-Speech (TTS) tuning system 10 starts the tuning process with a 
user 12 supplying raw text, e.g., ASCII or Unicode encoded text, to a TTS engine 
16. The raw text is plain simple text without any speech modulation tags or 
commands. The raw text can be entered either through a Graphical User 
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Interface (GUI) for entering text (not shown) or as a simple text file. The user 12 
can supply raw-text to the TTS engine 16 by using any available technique. 
Those skilled in the art will appreciate that the manner or format in which the user 
12 supplies raw text to the TTS engine 16 does not limit the invention. The 
interaction of the TTS engine 16 and a GUI editor 14 is described next. 

[0015] The TTS engine 16 receives the raw text from the user 12 and 
converts it internally to normalized text, because the input text can contain some 
unpronounceable characters or terms like dates, dollar amounts, etc. The TTS 
engine 16 includes a module called text normalizer (not shown) that expands 
unpronounceable character strings into pronounceable words. For example, the 
text normalizer will expand the string "10/25/1995" to the string "October twenty 
fifth nineteen ninety five". The output of the normalizer is called normalized text 
and each word from the normalized text is a normalized word. After converting 
the input text into normalized text the PASR format of the input text is generated 
by the TTS engine 16. PASR format is the processed representation of the input 
text that will be used by the GUI editor 14. 

[0016] The user 12's interaction with the GUI editor 14 is described 
next. The GUI editor 14 displays the PASR data received from the TTS engine 
16. The displayed data inside the GUI editor 14 includes visual representation of 
speech features as described in detail further below. The user 12 views the 
PASR data in the GUI editor 14 and then repeats the cycle of editing and 
listening until the desired audio reproduction of the text-to-speech conversion is 
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achieved. Thereafter, the user 12 can choose to store the edited text in the GUI 
editor 14. 

[0017] The TTS engine 16 produces a particular type of speech output 
that is more suitable to visual editing. The TTS engine 16 reports the origin of 
the transcription to the GUI editor 14. Hence, the GUI editor 14 can determine if 
a particular transcription of the word is the result of the TTS engine 16's 
processing or if it was supplied by the user. The phonetic transcription is a string 
of the phonemes that specify how the word should be pronounced. For 
example, for the word "ghost", one possible transcription will be "g ow s t". For 
example in some dialects of English the word "news" can be pronounced as "n 
uw z", in some others as "n y uw z". For the purposes of illustration, it is assumed 
that the default transcription is "n uw z" and that a user has supplied a user- 
defined transcription "n y uw z" for that word. The TTS engine 16 will recognize 
the user-defined transcription and will report to the GUI editor 14 the origin as 
defined by the user. The text will be synthesized according to the user's 
transcription, with the word "news" being pronounced as "n y uw z". The details 
of the GUI editor 14's structure and function are described next 

[0018] PASR format includes the normalized text produced by the TTS 
engine 16. It also includes aligned with the normalized text, the TTS parameters 
which were used to generate the synthesized sound. The PASR format can 
accommodate parameters for each normalized word and word-boundary. For 
each normalized segment like a word in text, properties that can be associated 
with graphic indicators are synthesized speech, normalized text, phonetic 
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transcription, prominence and relative speed. Synthesized speech is the audible 
representation of the word in some popular sound format. For example, the 
sound format can be PCM, 11 Khz, 16-bit, mono. Prominence denotes how 
important a particular word is in a given sentence. Usually, the higher 
prominence value is, the greater is the energy, the longer is the duration and the 
greater is the pitch variation that are associated with it. For each boundary the 
properties that can be associated with graphic indicators are synthesized 
waveform, boundary strength and pause length. Hence, within the PASR format 
each word or word-boundary can be displayed and modified in an independent 
manner. 

[0019] Fig. 2 shows a representation of the PASR format conversion 
process. The interface between the GUI editor 14 and the TTS engine 16 is 
implemented via PASR formatted text 17 as the TTS engine 16's input and 
PASR data as the TTS engine 16's output. PASR formatted text 17 is the textual 
representation of the PASR data, which can be directly generated from the PASR 
data by writing out the properties associated with each individual word or 
boundary into a text string using the TTS tag format. 

[0020] The PASR formatted text 17 can be passed through the TTS 
engine 16 multiple times without any change caused by the TTS engine 16, 
unlike the raw text that can undergo modification when passed through the TTS 
engine 16. This transitive closure guarantees that the PASR formatted text will 
stay unchanged irrespective of the number of times it is passed through the TTS 
engine. Therefore, the PASR formatted text can be stored in a database and can 
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be used to regenerate the same sound. Further, text edited through the GUI 
editor 14 can be used to generate a waveform by using a different TTS engine 
(not shown) that uses the same tag format as the TTS engine 1 6. Hence, the 
TTS engine 16 generates PASR data and supplies it as an input to the GUI editor 
14. 

[0021] Fig. 3 shows an exemplary GUI editor 14. The system 10 
provides a tool that functions like visual interface to the TTS engine. The visual 
interface tool provides multi-channel communication with the TTS engine, the 
communication between the TTS engine and the tool being carried out through 
the PASR format. The capabilities of the visual interface tool are defined and 
determined by those of the TTS engine. The GUI editor 14 is an example of 
such a visual interface tool and is described next in detail. 

[0022] The GUI editor 14 is typically in a window form. The GUI editor 
14 can be organized or designed in multiple ways. Those skilled in the art will 
readily recognize that the GUI editor 14 shown here is merely an example and 
does not limit the invention in any way. The GUI editor 14 can display words 18 
and word boundaries 20. Each one of the words 18 can have independent 
display characteristics. For example, a word can be displayed at a greater height 
and with a smaller font to display visually the emphasis in pronunciation that has 
to be used when converting it to a speech form. The user 12 (see Fig. 1) thus 
can use the GUI editor 14 to fine-tune the text-to speech synthesis process in an 
interactive manner. 
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[0023] The GUI editor 14 operates independent of the language of the 
text. The language specific operations are carried out by the TTS engine. 
Hence, the same GUI editor can be used for different languages by just replacing 
or modifying the TTS engine for a particular language. 

[0024] The visual tuning approach of the present invention eliminates 
the need for the user 1 2 (see Fig. 1 ) to have any special training or experience in 
the speech synthesis process. The user 12 can interactively control the 
pronunciation of each word and the pauses between words, among other 
features of the speech, to be produced from the text. Further, the present 
invention eliminates the need for the user 12 to know or remember any specific 
tags or commands to control the speech synthesis process because all required 
speech parameters can be modified visually. Hence, a system that can be 
operated by any user without any special training can provide significant savings 
in cost of customizing a text-to-speech synthesis system. 

[0025] Typically, controls can be included in a control-box 20 where 
specific values for prominence, speed, pause and boundary can be entered and 
modified. While the user can always modify the words 18 using a pointing device, 
like a mouse or a track-ball, the control-box 20 provides an additional way to 
precisely enter values for speech parameters. Other functions like play 22 (to 
generate sound output) and save 24 (to save the sound output) can be included 
in the GUI editor 14. 

[0026] A user can control, edit and test multiple speech features or 
parameters that are represented in a graphical form using graphical indicators or 
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features of a GUI. For example the following features and parameters can be 
tuned or adjusted: normalized (expanded) text, part-of-speech assignment, 
parsing of the text, chunking of the text, boundary strength, pause duration, 
phonemic and/or allophonic transcription including stress and syllabification, 
speech rate, syllable or segment duration, pitch (default, minimum, maximum, 
actual contour), word prominence, or emphasis, formant mixing mode (linear or 
logarithmic), unit selection override, intensity contour, formant trajectories, and 
allophone rules (turned on or off). Those skilled in the art will appreciate that the 
above listed speech features are merely examples of the visually tunable 
features of speech and the same do not limit the present invention. 

[0027] Typically for each word in the text allophonic transcription 
(pronunciation), prominence (intonation), and speed (speech rate) can be 
customized by the user using the visual editing interface. Further, between-the- 
words parameters such as pause-length and prosodic boundary strength can be 
customized. Typically, the graphical editing interface can be designed to edit the 
speech features on a word level. However, there is no such requirement and 
editing can be performed at other levels. For example, at the allophonic level or 
even by using continuous envelope curves like Bezier curves. 

[0028] A variety of graphical indicators or features can be used to 
represent speech features listed above in the text output within the GUI editor 14. 
For example speech features can be represented using variations in font faces; 
coloring of text; vertical and horizontal spacing between words and individual 
letters of the words; styles such as italic, bold, underlined, blinking and crossing- 
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out; orientation of the text, rotation of text, punctuation etc. Any of these or other 
graphical indicators can be used either individually or in combination to 
potentially produce a large set of graphical indicators that can be associated with 
the speech features for displaying in the GUI editor 14. Those skilled in the art 
will appreciate that the above examples of graphical indicators are mere 
illustrations and hence do not limit the invention in any manner. 

[0029] Fig. 4 is a graphical representation of the design of the visual 
tuning system according to the principle of the present invention. Fig. 5 shows 
the relation between the design of the tuning system and the GUI editor 14. An 
example of the GUI editor 14's design is described next. CMarkupview class 26 
is the basic class for displaying the text in a graphical form. Another class 
CMarkupWindow 28 shows the window inside the CMarkupview class 26's 
overall display area. Classes CSynthesizer 30 and CMarkup Model 32 form the 
PASR text input to the CMarkupView 26 class. An interface IMarkupltem 34 
abstracts one PASR text item and is related to the CMarkupModel class 32, 
which holds the PASR output of the synthesized speech. 

[0030] The IMarkupltem interface 34 is related to a CMarkupltemWord 
class 36 that represents a single word 18 (see Fig. 2); while a 
CMarkupltemBoundary class 38 represents a PASR boundary, i.e., a word 
boundary. Classes CMarkupViewltemWord 40 and CMarkupViewltemBoundary 
42 refer to the CMarkupltemWord class 36 and the CMarkupltemBoundary class 
38 and render graphical representations of a word and boundary. Interface 
IMarkupViewltem 44 is the base interface to abstract one item for view that can 



11 



Attorney Docket No. 9432-000257 

be either a word or a boundary. CMarkupViewltemFactory class 46 is used to 
create multiple instances of view items like words and boundaries that are then 
supplied to the CMarkupWindow class 28. The design includes other supporting 
classes that are listeners for trapping and processing events in the visual 
classes. Those skilled in the art would appreciate that the above design is 
merely an example of structuring a visual tuning system according to the 
principle of the present invention. 

[0031] Fig. 4 shows the graphical view of the basic classes in the 
design of the visual tuning system. The CMarkupView class 26 is the overall 
view of the GUI editor 14 that performs the visual editing functions. The 
CMarkupWindow class 28 represents the main graphical region for displaying 
text with sound features represented as visual variations. 

[0032] In the visual tuning approach of the present invention, the user 
can easily experiment with different speech parameters in a graphical and 
intuitive manner and then select the best combination of speech parameters for a 
given application. The above listed speech parameters are just examples of 
various speech parameters that can be visually tuned. Hence, those skilled in 
the art will appreciate that the above examples of speech parameters do not limit 
the invention in any manner. 

[0033] Under the visual tuning approach of the present invention, the 
changes in the sound of the text to be converted into speech are psychologically 
related to the graphical properties of the text shown in the GUI editor 14. For 
example, the graphical length of the word is related to the duration of 
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pronunciation. The longer the graphical representation of a given word 18, the 
longer will be the sound of the word. The relative vertical position of a given 
word 18 represents the prominence of the word. In a similar manner, many 
other graphical properties can be associated with other speech parameters. 
Those skilled in the art will appreciate that the above examples of relating 
graphical properties of displayed text and the sound produced by such text are 
merely examples and hence do not the limit the present invention. 

[0034] The present invention can be incorporated in a software, 
hardware or combination of software and hardware forms. For example, the 
visual tuning interface can be designed as an ActiveX control. Further, two 
windows can be provided, where one window is used to enter the text and the 
other window functions as the GUI editor 14 (see Fig. 3). A client-server model 
can be also be used. For example, the GUI editor 14 can be run on a client like a 
cellular phone or a handheld PDA and the TTS engine can be executed on a 
server. Those skilled in the art will appreciate that the particular configuration of 
the GUI editor 14 can be adapted for any particular application, and the same 
does not limit the invention in any manner. 

[0035] The principle of the present invention can be applied to various 
application of the invention. For example, the visual tuning control according to 
the principle of the present invention can be used to customize a car-navigation 
system. In such a system, the GUI editor 14 (see Fig. 3) has a set of fixed text 
messages with blank slots that are editable. The user can enter text to be 
pronounced in the blank slots, but not modify the other fixed text. The user can 
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visually modify a limited number of text parameters to control the speech output, 
for example, the speed or pauses. Hence, a car-navigation system that uses 
speech prompt can be easily built and customized even by a user who is not 
trained in text-to-speech conversion process. 

[0036] The description of the invention is merely exemplary in nature 
and, thus, variations that do not depart from the gist of the invention are intended 
to be within the scope of the invention. Such variations are not to be regarded as 
a departure from the spirit and scope of the invention. 
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